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Abstract 

Protein structure alignment is a fundamental problem in computational and structural bi- 
ology. While there has been lots of experimental/heuristic methods and empirical results, very 
little is known regarding the algorithmic/complexity aspects of the problem, especially on pro- 
tein local structure alignment. A well-known measure to characterize the similarity of two 
polygonal chains is the famous Frechet distance and with the application of protein-related 
research, a related discrete Frechet distance has been used recently. In this paper, following 
the recent work of Jiang, et al. we investigate the protein local structural alignment problem 
using bounded discrete Frechet distance. Given m proteins (or protein backbones, which are 
3D polygonal chains), each of length 0(n), our main results are summarized as follows. 

• If the number of proteins, m, is not part of the input, then the problem is NP-complete; 
moreover, under bounded discrete Frechet distance it is NP-hard to approximate the max- 
imum size common local structure within a factor of n 1 " 6 . These results hold both when 
all the proteins are static or when translation/rotation are allowed. 

• If the number of proteins, m, is a constant, then there is a polynomial time solution for 
the problem. 
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1 Introduction 



As a famous distance measure in the field of abstract spaces, Frechet distance was first denned 
by Maurice Frechet a century ago [7]. Alt and Godau first used it in measuring the similarity of 
polygonal chains in 1992 pp. It is well known that the Frechet distance between two two-dimensional 
(2D) polygonal chains (polylines) can be computed in polynomial time [U [2], and even under 
translation or rotation (though the running time is much higher) [3j. In three-dimensional space 
(3D), Wenk should that given two chains with sum of length N, the minimum Frechet distance 
between them can be computed in 0(A 3 ^ +2 log N) time, where / is the degree of freedom for 
moving the chains |22| . So with translation alone this minimum Frechet distance can be computed 
in in 0(A 11 logA) time, and when both translation and rotation are allowed the corresponding 
minimum Frechet distance can be computed in 0(N 20 log N) time. These results can be generalized 
to any fixed dimensions |22j . While computing (approximating) Frechet distance for surfaces is in 
general NP-hard [H [12] , it is polynomially solvable for restricted surfaces [4] . 

In 1994, Eiter and Mannila defined the discrete Frechet distance between two polygonal chains 
A and B (in any fixed dimensions) and it turns out that this simplified distance is always realized 
by two vertices in A and B [6j. They also showed that with dynamic programming the discrete 
Frechet distance between them can be computed in 0(|^4||-E>|) time. 

Recently, Jiang, Xu and Zhu applied the discrete Frechet distance in (globally) aligning the 
backbones of proteins (which is called the protein structure- structure alignment or more generally, 
the protein global alignment problem) [Hj. In fact, in this application the discrete Frechet distance 
makes more sense as the backbone of a protein is simply a polygonal chain in 3D, with each vertex 
being the alpha-carbon atom of a residue. So if the (continuous) Frechet distance is realized by an 
alpha-carbon atom and some other point which does not represent an atom, it is not meaningful 
biologically. Jiang, et al. showed that given two 2D (or 3D) polygonal chains the minimum discrete 
Frechet distance between them, under both translation and rotation, can be computed in polynomial 
time. They also applied some ideas therein to design an efficient heuristic for the original protein 
structure-structure alignment problem in 3D and the empirical results showed that their alignment 
is more accurate compared with previously known solutions. 

In essence, the result of Jiang, Xu and Zhu p3] implies that the protein global alignment 
problem, which is to find all proteins in a given set V similar to a query protein or some protein 
in V (under translation and rotation), is polynomially solvable. However, very little algorith- 
mic/complexity results is known regarding the protein local structure alignment problem. The 
only such recent result was due to Qian, et al. who showed that under the RMSD distance the 
problem is NP-complete but admits a PTAS [19]. On the other hand, there have been lots of 
experimental/heuristic methods with practical systems since 1989, e.g., SSAP [21], DALI [Tl] [TP], 
CATH P2|, CE [20], SCOP j5], MAMMOTH [H] and TALI [lj. In this paper, we show that if 
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many proteins are given then the local structure alignment problem, under the discrete Frechet 
distance, is very hard; on the other hand, if only a small number of proteins are given then there 
is a polynomial time solution for the problem. 

The paper is organized as follows. In Section 2, we introduce some basic definitions regarding 
Frechet distance and review some known results. In Section 3, we show the hardness result for the 
protein local structure alignment problem. In Section 4, we show how to solve the problem when 
m is a constant. In Section 5, we conclude the paper with several open problems. 

2 Preliminaries 

Given two 3D polygonal chains A,B with \A\ = k and \B\ = I vertices respectively, we aim at 
measuring the similarity of A and B (possibly under translation and rotation) such that their 
distance is minimized under certain measure. Among the various distance measures, the Hausdorff 
distance is known to be better suited for matching two point sets than for matching two polygonal 
chains; the (continuous) Frechet distance is a superior measure for matching two polygonal chains, 
but it is not quite easy to compute pQ. 

Let X be the Euclidean space M 3 ; let d(a, b) denote the Euclidean distance between two points 
a, b £ X. The (continuous) Frechet distance between two parametric curves / : [0, 1] — > X and 
g:[0,l]^X is 

M/>sO = ' m l m r ax d(f(a(s)),g((3(s))), 

where a and (3 range over all continuous non-decreasing real functions with a(0) = /3(0) = and 
a(l)=j0(l) = lB 

Imagine that a person and a dog walk along two different paths while connected by a leash; 
moreover, they always move forward, possibly at different paces. Intuitively, the minimum possible 
length of the leash is the Frechet distance between the two paths. To compute the Frechet distance 
between two polygonal curves A and B (in the Euclidean plane) of \A\ and \B\ vertices, respectively, 
Alt and Godau [1] presented an 0(|A||i?| log 2 (|A||Z?|)) time algorithm. Later this bound was 
reduced to 0(|A||B|log(|A||S|)) time [2]. 

We now define the discrete Frechet distance following [6]. 

Definition 2.1 Given a polygonal chain (polyline) in 3D, P = (pi, ■ ■ ■ ,Pk) of k vertices, a m- 
walk along P partitions the path into m disjoint non-empty subchains \Vi}i=i„ m such that Vi = 
(pki-i+i, ■ ■ ■ >Pki) and = k < fa < ■ ■ ■ < k m = k. 

Given two 3D polylines A = (ai, . . . , a&) and B = (b±, . . . ,bi) , a paired walk along A and 
B is a m-walk {Ai}i=i„ m along A and a m-walk {Bi}i=i.. m along B for some m, such that, for 
1 This definition holds in any fixed dimensions. 
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1 < i < m, either \Ai\ = 1 or \Bi\ = 1 (that is, Ai or Bi contains exactly one vertex). The cost of 
a paired walk W = {(Ai,Bi)} along two paths A and B is 

d}p (A, B) = max max d(a,b). 
The discrete Frechet distance between two polylines A and B is 

d F (A,B) = mmdf(A,B). 
w 

The paired walk that achieves the discrete Frechet distance between two paths A and B is also called 
the Frechet alignment of A and B. 

Consider the scenario in which the person walks (jumps) along A and the dog along B. Intu- 
itively, the definition of the paired walk is based on three cases: 

1. \Bi\ > \Ai\ = 1: the person stays and the dog moves (jumps) forward; 

2. \Ai\ > \Bi\ = 1: the person moves (jumps) forward and the dog stays; 

3. \Ai\ = \Bi\ = 1: both the person and the dog move (jump) forward. 




Fig. 1. The relationship between discrete and continuous Frechet distances. 

Eiter and Mannila presented a simple dynamic programming algorithm to compute dp(A,B) 
in 0(|^4||i?|) = 0(kl) time [6]. Recently, Jiang, et al. showed that the minimum discrete Frechet 
distance between two chains in 2D, A and B, under translation can be computed in 0(k 3 l 3 log(/c+/)) 
time, and under both translation and rotation it can be computed in 0{k i l i log(/c + I)) time (T4] . 
For 3D chains these bounds are 0(A; 4 / 4 log(/c + /)) and 0(k 7 l 7 log(/c + /)) respectively |14j . They are 
significantly faster than the corresponding bounds for the continuous Frechet distance (certainly 
due to a simpler distance structure), which are 0((k + I) 11 log(k + /)) and 0((k + I) 20 log(k + /)) 
respectively for 3D chains |22j . 

We comment that while the discrete Frechet distance could be arbitrarily larger than the corre- 
sponding continuous Frechet distance (e.g., in Fig. 1 (I), they are e?(<Z2, 62) and d{a2, o) respectively), 
by adding sample points on the polylines, one can easily obtain a close approximation of the con- 
tinuous Frechet distance using the discrete Frechet distance (e.g., one can use d(a2,b) in Fig. 1 (II) 
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to approximate d(a,2,o)). This fact was pointed before in [BJCES] and is supported by the fact that 
the segments in protein backbones are mostly of similar lengths. Moreover, the discrete Frechet 
distance is a more natural measure for matching the geometric shapes of biological sequences such 
as proteins. As we mentioned in the introduction, in such an application, continuous Frechet does 
not make much sense to biologists. 

In the remaining part of this paper, for the first time, we investigate the locally aligning a set 
of polygonal chains (proteins or protein backbones) in 3D, under the discrete Frechet distance. 

3 Protein Local Structure Alignment is Hard 

Given a set of proteins modeled as simple 3D polygonal chains, the Protein Local Structure Align- 
ment (PLSA) problem is defined as follows. 

Instance: Given a set m of proteins P\, P2, P m in 3D, each with length 0(n), and a real 
number D. 

Problem: Does there exist a chain C of k vertices such that the vertices of C are from Pi's, and 
C and a subsequence of Pi (1 < i < m) has discrete Frechet distance at most D (under translation 
and rotation)? 

If no translation and rotation is allowed, we call the corresponding problem static PLSA. For 
the optimal version of the problem, we wish to maximize k when D is given. The (polynomial-time) 
approximation solution will also be referred to as approximating the optimal solution value k* when 
it is hard to compute exactly. We will see that it is also hard to approximate k* even for static 
PLSA. We first prove the following theorem. 

Theorem 3.1 Given D = 5, the static PLSA problem does not admit any approximation of factor 
n l ~ e unless P=NP. 

Proof. It is easy to see that PLSA belongs to NP. We use a reduction from Independent Set to 
the Protein Local Structure Alignment Problem. Independent Set is a well known NP-complete 
problem which cannot be approximated within a factor of n 1_e [9]. The general idea is similar to 
that of the longest common subsequence problem for multiple sequences |15| . but our details are 
much more involved due to the geometric properties of the problem. 

Given a graph G = (V, E), V = {v±,V2, ■ ■ ■ , vjy}, E = {e±, e 2 , • • • , eju}, we construct M + 1 3D 
chains Pq, P\, P2, Pm as follows. (We assume that the vertices and edges in G are sorted by their 
corresponding indices.) 

The overall reduction is as follows: V = {Pq, P\, P2, Pm}, arid 

Po = (v'i,v' 2 , ■ ■ ■ ,v' n ), 
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where v\ = (i,i 2 ,0) is a 3D point for i = 1, ...,n. 

For each e r = (vi,Vj) in G, we have a corresponding sequence (3D chain) 

P r = (wi,«2,-- - ,v'i-l,v'i +1 ,--- ,v' n ,V , \,V ,, 2 ,--- ,v"j-l,v" j+ i,--- ,v" n ), 

where v[ = (i,i 2 ,0) and v"i = (i,i 2 ,S) are 3D points for i = 1, ...,n and <5 is an arbitrarily small 
positive real number less than 0.1. 

We claim that G has an independent set of size k if and only if there is a chain C of k vertices 
such that the discrete Frechet distance between C and a subsequence of P r , S r , is at most 5 (i.e., 
cIf(C, S r ) < 5). The following claims are made with the detailed proofs left out. 

Claim A. P r is a simple polygonal chain in 3D. 

Claim B. S r is a simple polygonal chain in 3D with \S r \ = k. 

If G has an independent set of size k, then the chain C can be constructed as follows. Let the 
independent set of G be ordered as / = (v^, v j 2 , vi k ) with %\ < i 2 < ... < i^. For r = 0, 1,...,M, 
we scan P r in a greedy fashion to obtain the first Vj or v" j such that the first component of its 
coordinate is i\. Repeat this process to obtain S r . Then let any S r be C. Obviously, C has k 
vertices and \S r \ = k for r = 0, l...,M. 

If there is a chain C of k vertices such that the discrete Frechet distance between C and a 
subsequence of P r , S r , is at most 5 (i.e., cIf(C, S r ) < 5), then we can see the following. 
Property (a) Let P r = (v[,v' 2 ,--- , v'^, v' i+1 , ■ ■ ■ , v' n , v"i, v" 2 , ■ ■ ■ , v"j-i, v" j+ i, ■ ■ ■ ,v" n ), then 
d(v' p , v" q ) > 3 for all p ^ q. 

Property (b) Let P r = (v[,v' 2 ,--- ,v' i _ 1 ,v' i+1 , ■ ■ ■ , v' n , v" 1: v" 2 , ■ ■ ■ , v" j+1 , ■ ■ ■ ,v" n ), then 

d(v' p ,v"p) < 5 for all p ^ i,p ^ j. 

Property (c) Let P r = (ui,u 2 , ■ ■ ■ ,uo( n )}, then \d(u p ,u q ) — d(u p / , u q > ) | >> 5 as long as the first 
components of the 4 coordinates of u p ,u q ,u p > ,u q i are all different. 

As 5 is very small, when d^(C, S r ) < 5, the vertices of C and S r must be matched orderly in 
a one-to-one fashion. (In other words, the man walking on C and the dog walking on S r must 
move/jump together at each vertex. Otherwise, dF(C,S r ) > 3 >> 5.) We now claim that the 
(ordered) vertices of C correspond to an independent set / of G; moreover, if C = (C±, C 2 , ■ ■ ■ ,Ck) 
and C p = (x p , y p , z p ), then v Xp € /. Suppose that C p = (x p , y p , z p ), C q = (x q , y q , z q ) and v Xp ,v Xq € / 
but there is an edge et = (v Xp ,v Xq ) G E. By our construction of P± (from et), v' Xp and v" Xq are not 
included in Pt and v' x precedes v n Xp in Pf. This is a contradiction. 

To conclude the proof of this theorem, notice that the reduction take O(MN) time. □ 

In the example shown in Figure 1, we have 

Pi = (vi,^,t^,t/ 5 ,t;''i,t;"2,i;"4,t;''5>, 

P 2 = ( V ' 1 ,4,^,^, V " 1 , V " 2 , V "3, V " 5 ), 

P 3 = (v' 2 ,v' 3 ,v' 4: ,v' 5 ,V n 3 ,V" 4 ,f" 5), 
P 4 = (v 2 ,V , 3 ,v' ll ,v' 5 ,V n 1 ,V ,, 2,v"3,v"5), 
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Figure 1. Illustration of a simple graph for the reduction. 



Ph = Wi,v' 2 ,v' A ,v'^v n \,v v ' 2 ,v r ' ' 3 ,v n 5 ) , and 

P6 = W 1 ,l/ 2 ,V , 3 ,v' 5 ,V n 1 ,V n 2,V t, 3 ,V n A). 



An example of P3 is shown in Figure 1 as well, in which case black nodes are on the Z = plane 
and white nodes are on the Z = 5 plane (apparently for the visualization reason, the XY-plane 
is slanted). The solid segments are on the Z = plane, the dotted segments are on the Z = 5 
plane and the only dashed segment connects two points on different planes. Corresponding to the 
optimal independent set {vi,v 3 ,v^} in G, the optimal local alignment C = (v^, v' 3 , v' 5 ) matches P3 
at its subsequence S3 = ^"3, ^"5). 

Corollary 3.1 Given D = S and when both translation and rotation are allowed, the (maximization 
version of) PLSA problem does not admit any approximation of factor n l ~ t unless P=NP. 

Proof. Due to Property (a), (b) and (c), translation/rotation will not be able to generate another 
C' which is topologically different from C . □ 
Notice that in our proof all the adjacent vertices in C could be non-adjacent in Pj, for i = 
0, 1, ...,m. Biologically, this might be a problem as one residue alone sometimes cannot carry out 
any biological function. Define a c-substring or a c-subchain of P, as a continuous subchain of P» 
with at least c vertices. Unfortunately, even if we introduce this condition by forcing that C is 
composed of k ordered c-substrings of each Pj, for some constant c, the above proof can be modified 
to maintain a valid reduction from Independent Set. Call this corresponding problem Protein 
Local Structure Alignment (PcLSA), in which C must be composed of k ordered c-subchains of 
each Pj. We have the following corollary. 

Corollary 3.2 The maximization version of PcLSA does not admit any approximation of factor 
n 1_e unless P=NP. 

4 Polynomial Time Solutions for PLSA When m is Small 

In this section, we present a polynomial time solution for the PLSA problem when m is a constant. 
We first show a dynamic programming solution for the static PLSA and then we show how to use 
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that as a subroutine for the general PLSA problem, when m is small. 



4.1 A Dynamic Programming Solution for the Static PLSA When m is Small 

In this subsection, we present a dynamic programming solution for the static PLSA problem when 
m is small. Such a solution can be used as a subroutine for the general PLSA problem. We first 
consider the case when m = 2. Besides C, we try to maximize the length of the aligned subsequences 
in Pi = A and P2 = B with \A\ = m, \B\ = n,2- For the ease of description, we only show how to 
obtain these lengths which are stored in D [—,—,—, — ] and M[— , —,—,—] respectively. It is easy to 
reconstruct C from these arrays. 

Let A[i\,i2] be a subchain of A starting from the index i\ and ending at the index i 2 . Let 
B[ji,j 2 ] be a subchain of B starting from the index j\ and ending at the index j 2 . D\i\,i 2 ,j\, j 2 ] 
stores the length of the aligned subsequences of Afii,^] as a consequence of the alignment of C 
and A[i\,i2], and C and B\j\,j 2 \. M[ii, i 2 , ji, j 2 ] is defined symmetrically. 

Intuitively D[— ] stores the length of aligned subsequences from chain A (dog's route) 
and M[— ] stores the length of aligned subsequences from chain B (man's route). Define 
Tp(ii, 12,31,32) as the sum of aligned subsequences in both ^[zi,^] and J3 [71 , j'2] . Writing A[i] as 
di and B[j] as bj, we have the dynamic programming solution as follows. 

T F {h,i2, h,32) = D(h,i 2 ,ji,j2) +M(h,i 2 , 31,32), 



where 



D(h,i2,ji,j2) = max < 



and 



max il < kl< i 2 {D(i 1 ,k 1 ,j 1 ,j 2 ) + 1} if d(a i2 ,b j2 ) < 5,\\ dog moves 

max 4l <k 1< i 2 ,j 1 <k 2 <j 2 {D(ii,k 1 ,j 1 ,k 2 ) + 1} if d(a i2 ,b j2 ) < 5,\\ both move 

max jl < k2<j2 {D(i 1 ,i 2 , ji, k 2 )} if d(a i2 ,b j2 ) < 6,\\ dog stays 

(1) 



m&x il < kl<i2 {M(i 1 ,k 1 ,j 1 ,j2)} if d(a i2 ,b j2 ) < 5,\\ man stays 

M(h,i 2 ,ji,j2) = max t max il < kl< i 2 j 1 < k2<j2 {M(i 1 ,ki,ji,k 2 ) + 1} if d(a i2 ,b h ) < S,\\ both move 

ymaxj 1 <k 2< j 2 {M(i 1 ,i 2 ,ji,k2) + 1} if d(a i2 ,b j2 ) < S,\\ man moves 

(2) 

The boundary cases are handled as follows. 



D(h,h,ji,ji) = M(ii,i 1 ,ji,j 1 ) 



1 if d(a il ,b jl ) < 5, 



(3) 



if d(a.i 1 , bj 1 ) > 5. 

The final solution value is stored in Tp[l,ni, l,n 2 ]. We have the following theorem. 
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Theorem 4.1 When m = 2, the static PLSA problem can be solved in 0(n 4 ) time and space. 

It is easy to generalize this algorithm to the more general case when m is some constant. We 
thus have the following corollary. 

Corollary 4.1 When m is a constant, the static PLSA problem can be solved in 0(m 3 n 2m ) time 
and 0(mn 2m ) space. 

4.2 A Polynomial Time Solution for PLSA When m is Small 

Apparently, for any solution for PLSA we should allow translation and rotation. When m = 2 
and when both translation and rotation are allowed, we can use a method similar to that in p3] 
to compute the optimal local alignment with fixed 5. The idea is as follows. Without loss of 
generality, we assume that A is static and we translate/rotate B and let r(B) be the copy of B 
after some translation/rotation. Let \A\ = m, \B\ = n2 and let / be the degree of freedom for 
moving B. As we are in 3D and both translation and rotation are allowed, we have / = 6. We 
can enumerate all possible configurations for A and t{B) to realize a discrete Frechet distance of 
5. There are 0((nin2)f) = 0(n 12 ) number of such configurations, following an argument similar 
to [22\ I14|. Then for each configuration, we can use the above Theorem 4.1 to obtain the optimal 
local alignment for each configuration and finally we simply return the overall optimal solution. 

Corollary 4.2 When m = 2 and when both translation and rotation are allowed, the PLSA problem 
can be solved in 0(n 16 ) time and 0(n 4 ) space. 

We comment that when m is larger, but still a constant, the above idea can be carried over 
so that we will still be able to solve PLSA in polynomial time. It follows from |22[ [T4] that we 
have 0{n m f ) = 0(n 6m ) number of configurations between the m chains. Then we can again use 
Corollary 4.1 to obtain the optimal local alignment for each configuration. The overall complexity 
would be 0(n 6m x m 3 n 2m ) = 0(m 3 n 8m ) time and 0(mn 2m ) space. Certainly, such an algorithm 
is only meaningful in theory. 

Corollary 4.3 When m is a constant and when both translation and rotation are allowed, the 
PLSA problem can be solved in 0(m 3 n 8m ) time and 0(mn 2m ) space. 

5 Concluding Remarks 

In this paper, for the first time, we study the complexity/algorithmic aspects of the famous protein 
local structure alignment problem under the discrete Frechet distance. We show that the general 
problem is NP-complete; in fact, it is even NP-hard to approximate within a factor of n 1_e . On 
the other hand, when a constant number of proteins are given then the problem can be solved in 
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polynomial time. It would be interesting to see the empirical comparisons of protein local structure 
alignment under the discrete Frechet distance with the existing methods. Another open problem, 
obviously, is whether it is possible to improve the running time of the dynamic programming 
algorithms in Section 4. 
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